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Abstract. Nonnegative matrix factorization has been widely applied in face recognition, text mining, as well as spectral 
analysis. This paper proposes an alternating proximal gradient method for solving this problem. With a uniformly positive lower 
bound assumption on the iterates, any limit point can be proved to satisfy the first-order optimality conditions. A Nesterov-type 
extrapolation technique is then applied to accelerate the algorithm. Though this technique is at first used for convex program, 
it turns out to work very well for the non-convex nonnegative matrix factorization problem. Extensive numerical experiments 
illustrate the efficiency of the alternating proximal gradient method and the accleration technique. Especially for real data 
tests, the accelerated method reveals high superiority to state-of-the-art algorithms in speed with comparable solution qualities. 

1. Introduction. Given a nonnegative matrix M, nonnegative matrix factorization (NMF) aims to find 
a pair of nonnegative factors X, Y such that M is approximated by the product of X and Y. Mathematically, 
this problem can be modeled as 

mmF{X,Y)=l\\XY - M\\l, subject to X € K™^^ F e M;""", (1.1) 

where M is the given m x n nonnegative matrix and r (usually r <^ min{r7i,n}) is pre-determined. It 
appears that NMF was first proposed and used by Paatero and his coworkers in areas of environmental 
science [17]. Unfortunately, they used positive matrix factorization" in their paper though they actually 
considered ^^nonnegative matrix factorization". The later popularity of NMF attributes a great deal to the 
publication of [10] in nature. Since then, NMF has been widely applied in data mining such as text mining 
[19], image mining [12], dimension reduction and clustering [3, 22], as well as spectral data analysis [18]. 
More information on NMF can be found in the survey paper [2], books [4, 5] and the references thereof. 
Recently, some variants of NMF were proposed in order to better fit applications, such as semi-NMF [6] 
and convex-NMF [20] . Discussions of these models have been out of the scope of this paper. We refer the 
interested readers to the papers where they are proposed and used. 

1.1. Existing algorithms for NMF. Since problem (1.1) is non-convex and it is proved NP-hard, no 
algorithm can be found to exactly solve the problem within polynomial time unless P=NP. Two early widely 
used algorithms for NMF are projected alternating least squares method [17] and multiplicative updating 
method [11]. The former algorithm alternatively updates X and Y by minimizing \\XY-M\\l, with respect 
to X and Y without the nonnegativity constraint and then projecting the solution to the nonnegative matrix 
cone. The closed-form updates are given as 

x'' = 7'+(Af(y'=-i)^(y'=-i(y'=-i)^)t), 
y'' = v+{iix'')'^xy{x'')'^M), 

where {V+{Z))ij = max(0, Z^j) and f denotes pseudo-inverse. The latter algorithm has much cheaper 
multiplicative updates 

{X% = iX''-%{M{Y'^-Y).,,/{X'^-'Y'-\Y''-Y + eh, V*,j, 
{Y% = {Y''-%{{X'^yMh/{{X'^y{X'^)Y''-'+eh, V*,j 

where £ > is used to avoid zero division. This method is simple to implement and often yields good results. 
However, it still lacks of convergence result, and it is quite sensitive to initial values. Once one component 
of X or y becomes zero at some iteration, it will stay zero during the whole process of the method. 

Despite of joint non-convexity, the objective function is convex with respect to each variable by fixing 
the other. Based on this observation, a series of alternating nonnegative least square methods (ANLS) were 
proposed such as projected gradient method [13], active set method [8] and block principal pivoting method 
[9]. These methods alternatively minimize the objective function F with respect to X and Y, one at a time 
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while fixing the other at its most recent value. Both X-subproblem and K-subproblem can be written as the 
following nonnegative least square problem 

mm hiW) ^^\\ AW ~B\\%. (1.2) 

Therefore, the speed of ANLS-type methods largely depend on how fast problem (1.2) can be solved. Re- 
cently, an ADM-based algorithm [23] was proposed for solving problem (1.1) and was observed superior to 
many other algorithms for both synthetic data and real image data. Due to non-convexity, the classical 
ADM is not guaranteed to converge. Hence, only a preliminary convergence result was provided under a 
strong assumption that the difference of two consecutive iterates converges to zero. Compared to ANLS-type 
methods, the ADM-based algorithm maintains faster convergence because it does not solve (1.2) exactly but 
instead makes one ADM update during each iteration. Unlike ANLS-type methods, the ADM-based algo- 
rithm is not guaranteed to decrease the objective function value after every iteration. In this paper, we will 
propose an alternating proximal gradient method for solving (1.1). This method maintains fast convergence 
like ADM-based algorithm but also has a better convergence result. 

1.2. Organization. The rest of this paper is organized as follows. An alternating proximal gradient 
method for solving (1.1) is derived and analyzed in Section 2. In order to speed up the algorithm, a Nesterov- 
type acceleration technique is applied. Extensive numerical experiments are done in Section 3. It illustrates 
that the applied acceleration technique indeed speeds up the proposed algorithm a lot. In addition, fast 
convergence is observed by comparing to several other existing methods. Especially for real data tests, 
our method is significantly better than the other compared algorithms. Section 4 concludes this paper by 
extending the same method to a sparse regularized NMF model. 

2. Algorithm and convergence result. Though ANLS-type method spends much time solving sub- 
problem (1.2), it converges within quite a few outer iterations. The fast outer convergence can probably be 
explained that a sufficient decrease of the objective function is obtained after each iteration. In this section, 
we will propose an algorithm which will obtain a sufficient decrease of the objective function after every 
iteration but never spends too much on the subproblem (1.2). Therefore, it will maintain an overall faster 
convergence than ANLS-type methods. 

2.1. Proximal gradient method. Consider the convex optimization problem 

min/(a;), 

where A' is a convex set and / is a smooth convex function with Lipschitz continuous gradient, i.e., 

l|V/(x)-V/(2/)||2 <i/||a:~2/||2, fova\lx,yeX, 

where Lf is the Lipschitz constant. The projected gradient method for (2.1) is 

x''+^ =Vx{x'' -akS/fix'^)). 

Choosing appropiate stepsize ak, the new iteration point x*^^^ can be regarded as a minimizer of the linearized 
proximal regularization of function / at point x'^ , i.e., 

x'^'+i = argmin/(a;'=) + {Vf{x''),x- x*') + ^\\x - x''\\l. 

In general, we can update the iterate by letting 

x^+^ = argmin/(2;'=) + (V/(/),:r - y/^) + :^||:^ _ /||2^ (2.2) 

where is a given point at which / is linearized. The closed form solution of (2.2) is given by 

^fc+i=^^(yfe_V/(/)/L^). 
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(2.1) 



With appropriate choice of y'^ , this method is guaranteed to reach a solution of (2.1). Simply letting y'^ — , 
an e-optimal solution can be obtained within 0(l/e) iterations. With a clever choice of , an e-optimal 
solution can be reached within 0{l/^/e) iterations. For details, please refer to [1, 16]. 
We can use this method to solve subproblem (1.2) by repeatedly updating W by 

W^+^ = max(0, W'' - V/i(M^'^)/L,0, (2.3) 

until convergence, where L^^ is a Lipschitz constant of gradient Vh{W). However, as mentioned above, it 
may take much time to exactly solve the subproblems at each iteration. It turns out that it is sufficient to 
implement one step of update (2.3) for both X-subproblem and y-subproblem at each iteration. Namely, 
we can update X, Y by 

X'^+i =max(0,J^''■-Vxi^(^^>"'')AxO, (2-4a) 
Y^+'^ = max(0,y'' - VyF(X'=+\ y'=)/Lyfc), (2.4b) 

where ix*" a-nd Lyi- are Lipschitz constants of gradient V xF{X,Y'^) and VyF{X'^'^^ ^Y) by regarding F 
as a function with respect to X and Y , respectively. In our algorithm, we will take Lx^ and Lyk as the 
spectral norm of Y^{Y^)^ and (X'^+^)^X'^+^, respectively. Since r <C min{m, n}, this can be done very 
efficiently at each iteration. 

The pseudocode of the above algorithm can be written as follows. 

Algorithm 1 Alternating proximal gradient method for NMF (APG) 
Input; m X n nonnegative matrix M and pre-determined dimension r 
Initialization: randomize A'^ and Y'^ as nonnegative matrices of appropriate size 
for fc = 0, 1,2, ... do 

Update X,Y according to formula (2.4). 
if some stopping criterion is met then 

Stop and output (A'^+i, F'^+i) 
end if 
end for 



2.2. Convergence analysis. The update method (2.2) is analyzed in [1] with X = M". However, it 
is mentioned that their main result Lemma 2.3 can be adapted for the case with convex constriant x € X. 
Therefore, we can get the following result through essentially the same proof of Lemma 2.3 in [1] which we 
omit here. 

Lemma 2.1. Let y^ be a given point in the domain of f and Lf be a Lipschitz constant ofVf. Suppose 
x'^'^^ is obtained by (2.2). Then for any x G X, 

fix) - /(x'^+i) > ^llx'^+i - + L/(/ - - 2/'=). 

Specifically, for updates (2.4), it holds for any nonnegative matrices X,Y that 

F{X,Y^) - F{X''+\y'') > ^IIA'^ - X'^+^Wl + Lx^X^ " X,X''+^ - A'=), 

and 

Based on this lemma, we next provide a convergence result of algorithm APG under some mild assump- 
tions. A point (A, y) satisfies the KKT-condition of (1.1) if 

A (Ay - M)Y^ ==0, A > 0, (Ay - M)Y^ > (2.5a) 

y A^(Ay - M) 0, y>o, A^(Ay-M)>o (2.5b) 
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where denotes component-wise product. 

Theorem 2.2. Suppose the sequence {{X^ ,Y^y\ generated by algorithm APG is uniformly away from 
zero, i.e., there exists L > such that Lx^ > L and Lyk > L. Then any limit point of {{X'^ ,Y'^)} satisfies 
the KKT- conditions (2.5). 

Proof. Replacing X with X'', Y with Y'^ in Lemma 2.1, we obtain that 

FiX^Y"") - F{X''+\Y''+^) > ^\\X''+^ - X'^Wl + ^\\Y''+^ - (2.6) 

which means the objective function value F(X'',Y'^) is nonincreasing. Since F{X'',Y'') is bounded below 
by zero, then it must converge to some real number F > 0. Summing up both sides of inequality (2.6), we 
have 



Fix^^Y"^) -F>Y.( —11^'+' - ^'IIf + ^lir'^+i - Y'^wl 



> ^(||x'=+i ~x''\\l + - Y'^wl), 

where the last inequality follows from Lx^ > L and Lyk > L. Therefore, the differences X'^^^ — X^ and 
yfe+i _ yfc both asymptotically vanish. 

Suppose {X, Y) is a limit point of {{X'^ ,Y^y} . Then there exists a subsequence {{X^^ , Y^^ )} converging 
to {X,Y). Based on the above analysis, the sequence {(X'^^^-^, y'^^"'"^)} also converges to {X,Y). According 
to the update rules (2.4), it holds that 

X^^+^ = max(0,X'=^- - {^x^3Y^^ - M){Y^'Y / L^u^), 
Y^^+^ = max(0,r'=^ - {X^o+^)^ {X^o+'^Y^^ - M)/Ly>^^). 

Letting j ^ oo, we have 

X = max(0, X ~ {XY - M)Y^ /Lx), 
Y = max(0,r- X^(Xy ~ M)/Ly), 

where Lx = Kna^iYY'^) and Ly — Aniax(^^^)- This implies X is a minimizer of the linearized proximal 
regularization of F{X,Y) regarded as a function with respect to X at point X, i.e., 

X = argminF(X,y) + {S/xF{X,Y),X - X) + ^\\X - X\\l,. 



Assume X* is a solution of problem 



mini^(X,r). (2.7) 



x>o 



Using Lemma 2.1 again, we have F{X*,Y) — F{X,Y) > 0, which implies X is also a solution of (2.7). 
Therefore, X must satisfy the KKT-conditions of (2.7) which is exactly (2.5a). In the same way, we can 
prove {X,Y) satisfies (2.5b). This completes the proof. □ 

2.3. Accelerated proximal gradient method. Though the proximal gradient method maintains a 
nice convergence result, it converges relatively slow as it approaches the solution. To speed up the algorithm, 
an extrapolation technique can be applied to accelerate its convergence. Back to the update (2.2), we can 
linearize the function f dX y^ = +u!k{x^ —x^^^) instead of y^ = a;'"', where LxJk > is extrapolation weight. 
The intuitive idea behind this kind of acceleration technique can be explained as follows. If the current iterate 
x^ approaches the optimal point x* more than the previous iterate x^^^ , then the extrapolation point may 
be closer to x* than x^ with appropriate extrapolation weight. See Figure 2.1 for illustration. For convex 
program, specific extrapolation weight can be chosen such that overall acceleration can be reached. For 

example, Wk = (tfe-i — l)/ifc is used and analyzed in [1, 16], where ifc = (1 + ^1 + 4t^_.^)/2 with initial value 
to = 1. This technique called Nesterov-type acceleration technique makes the proximal gradient method 
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Fig. 2.1. u is the previous iterate and v is the current iterate. The extrapolation point v + 0.5(i' — u) becomes closer to 
the optimal point x* than v. 



reach a theoretically best convergence rate of first-order method [15]. It is also observed that an appropriate 
extrapolation works well for non-convex problem. The authors of [21, 14] dynamically chose a sequence 
of extrapolation weights which makes their SOR method significantly outperform the corresponding Gauss 
Seidel method. 

In this section, we will use Nesterov-type acceleration technique to speed up algorithm APG, since 
essentially the algorithm APG alternatively applies proximal gradient method to the subproblems. It turns 
out that the acceleration technique can indeed speed up algorithm APG. Actually, the heuristic extrapolation 
technique used in [21, 14] is also helpful for our algorithm, but not so stable as the Nesterov-type technique 
for problem (1.1). Currently, we are not able to provide a theoretical analysis about why the Nesterov-type 
acceleration technique works so well for non-convex problem. We reserve it as our future work. Incorporating 
the acceleration technique into algorithm APG, we arrive at the following accelerated APG method. 



Algorithm 2 Accelerated alternating proximal gradient method for NMF (AAPG) 
Input: TO X n nonnegative matrix M and pre-determined dimension r 

Initialization: randomize X^^ and as nonnegative matrices of appropriate size. Set X'^ = X 
yo ^ y-i and t_i = 1. 
for fc = 0, 1,2, ... do 



tk = {l + ^l + 4tl_^)/2 (2.8a; 

Wfe = {tk-i - l)/tk (2.8b; 

z';^ = x'' + LOkix'' - x'''^) (2.8c; 

Z^ = +LUkiY'' -Y''-^) (2.8d; 

X'^+i = max(0, Z^ - V xF{Z'' ,Y^)/Lx>'): (2.8e; 

Y^+^ max(0,Z^ - VyF(X'^'+\Z^)/Ly.), (2.8f; 



if some stopping criterion is met then 

Stop and output {X^+^ ,Y^+'^) 
end if 
end for 



To make the objective function value decreasing after each iteration, we can let (Ar'"'+^, y'=+i) = (X'', Y^) 
and tk — tk-i, whenever F(X'^+^, > F{X'', Y'^). Throughout all of our tests, we will apply the above 

modification to algorithm AAPG. However, it performs very robust even without such modification. It 
is observed that the above situation only happens when r becomes relatively large. Figure 2.2 plots the 
comparison results between proximal gradient method and its accelerated version. The tested matrix is in 
the form of M = LR + aN, where L and R are generated by Matlab command rcLndn(m,r) and randn(r,n) 
and then projected to nonnegative matrix cone, respectively, and is a nonnegative Gaussian noise. In 
this test, TO and n are both set to 300, r is set to 20, and noise-level a = 0.05 is chosen for the noise case. 
The relative error is calculated by \\XY — M\\p/\\M\\p. We can see that the convergence speed of APG is 
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Fig. 2.2. Comparison results for algorithm APG and AAPG with random matrix M of size 300 X 300. 



increased at least 3 times with acceleration technique for both noiseless and noise cases. 

3. Numerical experiments. In this section, we will compare the algorithms proposed in last section 
and some state-of-the-art algorithms with both synthetic data and real data. So far, there have been a 
great quantity of algorithms proposed for solving NMF. We do not want to exhaust all these methods 
but only consider several most popular ones or most recent ones that have been shown superior to many 
other algorithms. The first two we will compare with are alternating least square method (Als) [17, 2] and 
multiplicative updating method (mult) [11]. Both of the methods have been written as Matlab functions 
and can be called by nnmf with solver specifier als and mult, respectively. One recent ANLS-type method 
blockpivot is included for comparison. It is shown in [9] that blockpivot outperforms all the other compared 
ANLS-type methods in both speed and solution quality. Another algorithm we want to compare with is 
the recent ADM-based method (ADM) introduced in [23]. Although both blockpivot and ADM have been 
shown superior to Als and mult in many cases, we still include Als and mult for comparison due to their 
popularity and easy availability. 

3.1. Implemention details. From inequality (2.6), it follows that the objective function value is 
nonincreasing. The small change between two consecutive objective function values implies the closeness 
between two iterates. Hence, we will terminate algorithm APG and AAPG whenever 

?±^3±i < toi, 

Fk + l - ' 

where Fk denotes F{X'^,Y'') and tol is a given tolerance. The algorithms are also terminated if a relatively 
small objective function value has been reached, i.e., 

Fk 



\M\ 



< tol. 



In addition, we require the first conditions to be satisfied in at least three consecutive iterations. Here, we 
simply use a same tolerance for both stopping criterions. Of course, we can apply two diffenrent tolerances 
for diffenrent conditions. 

Throughout all of our tests, we will set tol — 10~^ for all algorithms except ADM since 10"'* is somehow 
loose for ADM. Hence, we set tol = 10~^ for ADM. The maximum number of iterations is set to 1000 for all 
algorithms. Same initial values are used for all the methods except mult since mult is very sensitive to initial 
value. Instead, we let mult replicatively do 10 times with a maximum number of iterations set to 10 at the 
beginning and then choose the best output as its initial value. For all the compared methods, all parameters 
are set to their default values and default termination criterions are employed, unless specified otherwise. 
All tests are performed on a Lenovo T410 laptop with an i7-620m CPU and 3-GB RAM and running 32-bit 
Windows 7 and Matlab 2010b. 
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Table 3.1 

Comparison results with nonnegative Gaussian random matrices of size m X n with n fixed at 1000. 





APG 


AAPG 


ADM 


blockpivot 


Als 


mult 


(m, r) 


Rclcrr 


Time 


Rclcrr 


Time 


Rclcrr 


Time 


Rclcrr 


Time 


Rclcrr 


Time 


Relerr 


Time 


(200, 10) 


9.85e-5 


0.60 


9.42e-5 


0.21 


8.41C-5 


0.28 


8.83e-5 


0.87 


7.93e-5 


0.47 


l.lOe-2 


2.37 


(200, 20) 


l.llc-4 


2.89 


9.76C-5 


0.75 


1.41e-4 


0.68 


1.06e-4 


3.40 


7.46e-5 


0.83 


1.20e-2 


5.88 


(200, 30) 


1.46C-2 


5.02 


9.83e-5 


1.63 


1.94e-4 


1.19 


1.15e-4 


6.45 


7.57e-5 


1.40 


2.24e-2 


11.4 


(1000, 10) 


l.OOc-4 


2.53 


9.23e-5 


0.93 


6.50e-5 


1.10 


3.87e-5 


1.58 


5.42e-5 


2.38 


9.81e-3 


11.3 


(1000, 20) 


9.98C-5 


10.5 


9.39e-5 


2.58 


1.09e-4 


2.20 


7.65e-5 


5.74 


3.71e-5 


5.51 


1.04e-2 


26.5 


(1000, 30) 


2.22C-3 


20.4 


9.86e-5 


5.83 


1.65e-4 


3.97 


9.11e-5 


10.8 


3.76e-5 


8.92 


1.26e-2 


50.5 



Table 3.2 

Comparison results on 2000 selected images from CBCL face database with 361 X 2000 tested matrix and dimension r 
varying among {30, 60, 90} 





AAPG 


ADM 


blockpivot 


Als 


mult 


T 


Rclcrr 


Time 


Relerr 


Time 


Relerr 


Time 


Relerr 


Time 


Relerr 


Time 


30 


1.91e-l 


3.68 


1.92e-l 


7.33 


1.90e-l 


21.5 


3.53e-l 


3.15 


2.13e-l 


6.51 


60 


1.42e-l 


12.5 


1.43e-l 


19.5 


1.40e-l 


63.2 


4.59e-l 


1.80 


1.74e-l 


12.1 


90 


1.13e-l 


26.7 


1.15e-l 


34.2 


1.12e-l 


111 


6.00e-l 


2.15 


1.52e-l 


18.4 



3.2. Random matrices. Nonnegative Gaussian random matrices are employed in this subsection for 
comparison. Each matrix can be written as M = LR where L E M™^'' and R e both generated by 
Matlab command randn and then projected to nonnegative matrix cone. Note that the dimension r used here 
is exactly same as that used in the model (1.1). Namely, each tested matrix can be exactly decomposed as the 
product of two low dimensional nonnegative matrices and we use exact rank estimation in this subsection. 
Algorithms are compared with fixed n = 1000 and m chosen from {200, 1000}, r from {10,20,30}. Table 
3.1 lists the detailed results. Each result is average of 10 independent trails. Relerr denotes relative error 
calculated by \\XY — AI\\f /\\M\\p, and time is measured by CPU time (sec). From the table, we can see 
that all methods get quite good solutions except APG for r = 30 and mult for all r's. It is observed that 
APG did not stop within 1000 iterations for r — 30. That is why APG outputs worse solutions at r = 30. 
AAPG and ADM are comparable in both CPU time and solution quality. Blockpivot and Als reach more 
accurate solutions most times but take more time than AAPG and ADM. Since AAPG is always better than 
APG, we will drop APG from our comparison list in the next subsections. 

3.3. Image data. In this subsection, we will test algorithm AAPG, ADM, blockpivot, Als and mult 
on two image databases used in [7, 13]: 

• CBCL database (http://www.ai.mit.edu/projects/cbcl) 

• ORL database (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html) 

There are 6977 face images in the training set of CBCL database, with each one of 19 x 19 pixels. For our 
test, 2000 images are selected from the training set. We reshape each image into a column vector and obtain 
a matrix M of size 361 x 2000. Figure 3.1 (a) shows the first 36 faces corresponding to the first 36 columns 
of matrix M. Each face is photoed more than once with difference illuminations or expressions. In this 
test, we vary r in {30,60,90}. The comparison results are shown in Table 3.2. All results are average of 
10 independent trials. From the table, we can see that AAPG outperforms ADM in both time and solution 
quality. Blockpivot gets slightly better solutions than AAPG (less than 1% accurate), but is much slower 
(about 400% more time). Algorithm Als and mult actually both fail in this test. Especially for Als, it 
stagnates at a low-quality solution very early. Figure 3.1 (b)-(f) plots 36 base images corresponding to the 
first 36 columns of basis matrix X for each method at r ~ 90. Each basis matrix X is normalized such that 
the maximum element equals to one before plotting. As shown in [10], NMF can be used to learn parts of 
an object. The successful algorithms AAPG, ADM and blockpivot all get basis matrices containing local 
features of face images. All of the three basis matrices are relatively sparse. The basis matrix X obtained 
by AAPG is the sparsest one with about 24.9% non-zeros after zeroing out all elements below 10~^. There 
are about 31.5% non-zeros for the densest one obtained by ADM. Due to the poor performance of Als and 
mult, we will only consider AAPG, ADM and blockpivot in the following tests to save testing time. 
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Fig. 3.1. CBCL database and comparison base images: (a) 36 images selected from 2000 tested images; (b)-(f) 36 base 
images corresponding to the first 36 columns of basis matrix X obtained from algorithm AAPG, ADM, blockpivot, Als and 
mult at r = 90. 




(d) blockpivot (e) Als (f) mult 



Table 3.3 

Comparison results on all images from ORL face database with 10304 X 400 tested matrix and dimension r varying among 
{30, 60, 90} 





AAPG 


ADM 


blockpivot 


r 


Relerr 


Time 


Relerr 


Time 


Relerr 


Time 


30 


1.67e-l 


15.8 


1.71e-l 


46.5 


1.66e-l 


74.3 


60 


1.41e-l 


42.7 


1.45e-l 


88.0 


1.40e-l 


178 


90 


1.26e-l 


76.4 


1.30e-l 


127 


1.25e-l 


253 



ORL database has 400 images which are divided into 40 groups. Each group has 10 images of one face 
photoed from 10 different directions and with different expressions. We use all these images for test. Each 
image has 112 x 92 pixels and we form each one as a long column vector by stacking all of its columns. Thus 
a marix M of size 10304 x 400 is obtained. Figure 3.2 (a) depicts 50 photos corresponding to the first 50 
columns of M. Again, we choose r from {30, 60, 90}. Average results of 10 independent trials are listed in 
Table 3.3. It demonstrates again that AAPG is better than ADM in both speed and solution quality, and 
AAPG is about 3 times faster than blockpivot at the expense of less than 1% accuracy. Figure 3.2 (b)-(d) 
depict 50 base photos for each algorithm at r = 90 corresponding to the first 50 columns of basis matrix 
X. This time, all the three algorithms get the frames of face images instead of local features. None of them 
get a sparse basis matrix X. The sparsest one obtained by blockpivot has 60.6% non-zeros after setting all 
elements below 10~^ to zero. 

3.4. Hyperspectral data. It is shown in [18] that NMF (1.1) can be employed to spectral data 
analysis. A regularized NMF model is also considered in that paper with penalty terms q;||X|||, and 
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Fig. 3.2. ORL database and comparison base images: (a) 50 images selected from 400 tested images; (b)-(d) 50 base 
images corresponding to the first 50 columns of basis matrix X obtained from algorithm AAPG, ADM and blockpivot atr = 90. 
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(a) ORL data (b) AAPG (c) ADM (d) blockpivot 



Fig. 3.3. Hyperspectral data: four selected slices. 
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added in the objective function. The parameters a and /3 can be tuned for specific purpose in practice. 
However, we only focus on the original NMF model (1.1) in this section and our purpose is to show the 
effectiveness of our algorithm. It is worth mentioning that our method can easily be extended to such a 
regularized NMF model. As shown in the last section, our method works quite well for sparse-NMF, with 
sparse regularized terms a||X||i and added in the objective function. In this subsection, we will use 

a 150 X 150 X 163 hyperspectral cubic data to test the compared algorithms for solving (1.1). Each slice is 
reshaped as a long column vector by stacking its columns and the cube is scaled such that the maximum 
element equals to one. In this way, a 22500 x 163 matrix M is obtained. Four selected slices are shown in 
Figure 3.3 corresponding to the 1st, 50th, 100th and 150th columns of M. The dimension r is chosen from 
{20,30,40,50} for test. Average results of 10 independent trials are shown in Table 3.4. We can see from 
the table that AAPG is superior to ADM and blockpivot in both time and solution quality. 

4. Conclusion and discussions. This paper proposed an alternating proximal gradient method for 
solving nonnegative matrix factorization problem. Under some mild assumptions, it is proved that any 
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Table 3.4 

Comparison results on hyperspectral cube of size 150 X 150 X 163 with dimension r varying among {20, 30, 40, 50} 





AAPG 


ADM 


blockpivot 


r 


Relerr 


Time 


Relerr 


Time 


Relerr 


Time 


20 


1.18e-2 


34.2 


2.34e-2 


87.5 


1.38e-2 


62.5 


30 


9.07e-3 


63.2 


2.02e-2 


116 


l.lOe-2 


143 


40 


7.56e-3 


86.2 


1.78e-2 


140 


9.59e-3 


194 


50 


6.45e-3 


120 


1.58e-2 


182 


8.e-3 


277 



limit point of the iterates generated by this method satisfies first-order optimahty conditions. However, this 
method converges quite slow as the iterate approaches to the solution. Inspired by the fact that a Nesterov- 
type acceleration technique can significantly speed up the proximal gradient method for convex program, we 
applied this extrapolation technique to the non-convex problem. It turns out that the extrapolation helps 
a lot to accelerate the proposed method. Extensive numerical experiments illustrate the effectiveness of the 
accelerated alternating proximal gradient method (AAPG). The test on synthetic data shows that AAPG is 
comparable with other state-of-the-art algorithms in both speed and solution quality. All tests on real data 
demonstrate that AAPG has a significant advantage in speed with comparable accuracy as other methods. 
Especially for the hyperspectral data test, AAPG shows its superiority in both speed and solution quality 
among the compared algorithms. 

Before we finish this paper, we would like to show that the accelerated alternating proximal gradient 
method can be easily applied to the following sparse-NMF model 

mm^^\\XY-M\\% + a\\X\U+[3\\Y\U, (4.1) 

where ||Ar||i = J^ij Due to nonnegativity, ||^||i — -^ij- J^^^ ^^w we derived APG in Section 2, 

we can alternatively approximate the objective function of (4.1) by a quadratic function and then minimize 
the quadratic approximation function with respect to X and Y at each iteration. Similar to (2.8), all the 
updates can be written in a closed form as follows. 



tk = (l + ^l + 4tl_,)/2 (4.2a) 

ujk = {tk-i - l)/tk (4.2b) 

Z^x = + ^k{X^ - X^^^) (4.2c) 

Z^ = Y^ + i^k{Y^ - Y^-^) (4.2d) 

X^+i =max(0,Zi-(VxF(Z^y'=) + a)/L;,O, (4.2e) 

r'^+i = max(0, - {VyF{X''+\Z^) + /3)/LyO, (4-2f) 



where F and Lxk^Lyk have the same meanings as in Section 2. The method (4.2) is named as algorithm 
SAAPG. We compared algorithm AAPG, SAAPG, ADM and blockpivot on 500 x 500 random matrices with 
sparse basis X. Each matrix can be written as M = Li?, where L G M™^'' is generated by calling Matlab 
command sprauidn with density argument p = 0.1,0.15,0.20 and R S is generated by calling randn. 

Parameters a and /3 are set to 0.025p and for SAAPG, respectively. All other settings are the same as in 
Section 3. Table 4.1 lists the comparison results, all of which are average of 10 independent trials. Relerr 
and time have the same meaning as in Section 3. Density denotes the percentage of non-zeros in the basis 
matrix X after setting elements below 10~^ to zero. Though the density parameter is set to p, the actual 
percentage of non-zeros of matrix L is usually less than p. That is why SAAPG and blockpivot can get 
sparser basis matrix X than the specified density level p. We can see from the table that all algorithms get 
high-quality solutions, except SAAPG and blockpivot failed once when r = 30,p = 0.10. With the sparse 
regularized term, SAAPG indeed obtains sparser solution than what AAPG gets without the penalty term 
and SAAPG gets sparsest basis in average for all cases. 
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Table 4.1 

Comparison results on 500 x 500 Gaussian random matrices with sparse basis 





AAPG 


SAAPG 


ADM 


blockpivot 


ir,p) 


Rclcrr 


Time 


Density 


Relcrr 


Time 


Density 


Relcrr 


Time 


Density 


Relcrr 


Time 


Density 


(10, 0.10) 


6.06e-5 


0.16 


0.1515 


5.72e-5 


0.17 


0.0702 


3.39e-5 


0.15 


0.2645 


1.52e-5 


0.45 


0.0951 


(10, 0.15) 


5.60e-5 


0.17 


0.1879 


5.87e-5 


0.18 


0.0924 


3.88e-5 


0.12 


0.4351 


1.36e-5 


0.41 


0.1342 


(10, 0.20) 


5.77C-5 


0.16 


0.2691 


5.71e-5 


0.17 


0.1126 


2.74e-5 


0.12 


0.5053 


1.72e-5 


0.46 


0.1732 


(20, 0.10) 


7.07C-5 


0.40 


0.1525 


7.25e-5 


0.40 


0.0694 


8.39e-5 


0.26 


0.5496 


3.45e-5 


1.03 


0.1071 


(20, 0.15) 


6.79C-5 


0.38 


0.2151 


6.94e-5 


0.40 


0.0981 


9.34e-5 


0.25 


0.6846 


3.51e-5 


1.12 


0.1428 


(20, 0.20) 


7.02e-5 


0.42 


0.2647 


6.91e-5 


0.41 


0.1195 


9.93e-5 


0.25 


0.8223 


3.53e-5 


1.29 


0.1610 


(30, 0.10) 


8.07e-5 


0.84 


0.1342 


4.93e-3 


0.77 


0.0809 


2.28e-4 


0.51 


0.7098 


5.06e-3 


1.65 


0.1097 


(.30. 0.15) 


8.04e-5 


0.77 


0.1913 


7.77e-.5 


0.83 


0.0987 


9.38e-5 


0.46 


0.7720 


4.41C-5 


1.99 


0.1306 


(30, 0.20) 


7.7(ic-r) 


0.79 


0.2010 


(i.OiSo-.j 


O.Hl 


0.1202 


7.370-.') 


0.11 


0.9013 


5.900-5 


2.09 


0.1522 
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